Value-free random exploration is linked to impulsivity

Deciding whether to forgo a good choice in favour of exploring a potentially more rewarding alternative is one of the most challenging arbitrations both in human reasoning and in artificial intelligence. Humans show substantial variability in their exploration, and theoretical (but only limited empirical) work has suggested that excessive exploration is a critical mechanism underlying the psychiatric dimension of impulsivity. In this registered report, we put these theories to test using large online samples, dimensional analyses, and computational modelling. Capitalising on recent advances in disentangling distinct human exploration strategies, we not only demonstrate that impulsivity is associated with a specific form of exploration—value-free random exploration—but also explore links between exploration and other psychiatric dimensions.


Model descriptions
In this study, we use the models that were developed and validated for this task in our previous work 1 .
Here, we re-print these equations for completion. For a summary of the parameters of each model cf. Supplementary Table 4. The value of each bandit is represented as a distribution ( , ) with = 0.8. Participants have prior beliefs about bandits' values which we assume to be Gaussian with mean 0 (prior mean; free parameter) and uncertainty 0 (prior variance; free parameter).

Mean and variance update rules
At each time point , in which a sample , of one of the bandits is presented, the expected mean and precision = ( 1 ) +1 = + ( 2 ) with = 1 2 the sampling precision, = 0.8 the fixed sampling variance, the presented sample, is the bandit and the time point. Those update rules are equivalent to using a Kalman filter 2 in stationary bandits.

Models
We examined three base models reflecting complex exploration strategies: the UCB model, the Thompson model and the hybrid model. The UCB model encompasses the UCB algorithm (captures directed exploration) and a softmax choice function (captures the so-called value-based random exploration). The Thompson model reflects Thompson sampling (captures an uncertainty-dependent exploration). The hybrid model captures the contribution of the UCB model and the Thompson model, essentially a mixture of the above. We will compute three extensions of each model by either adding value-free random exploration ( , )={1,0}, novelty exploration ( , )={0,1} or both heuristics ( , )={1,1}. To make sure that the UCB models are not penalized because of the higher number of free parameters compared to the Thompson sampling models, we will also look at the UCB models with fixed inverse temperature parameter ( =1). This leads to a total of 16 models (see the labels on the x-axis in Supplementary Figure 5). A coefficient =1 indicates that a ϵ-greedy component was added to the decision rule, ensuring that once in a while (every ϵ % of the time), another option than the predicted one is selected. A coefficient =1 indicates that the novelty bonus is added to the computation of the value of novel bandits and the Kronecker delta in front of this bonus ensures that it is only applied to the novel bandit. For each model, the probability of choosing a bandit is described below. Please note that these three complex models make relatively similar predictions in our task, and that our model selection is primarily targeted at establishing the presence of exploration heuristics (in addition to complex exploration strategies).

UCB model:
In this model, an information bonus is added to the expected mean of each option, scaling with the option's uncertainty .

Thompson sampling model:
In this model, the overall uncertainty can be seen as a more refined version of a decision temperature 3  ɸ the multivariate Normal density function, the matrix computing the pairwise differences for each bandit and ,~( , , , 2 ) a sample taken from each bandit.
Hybrid model: This model allows a combination of the UCB model and the Thompson model 3 .

Parameter estimation
To fit the parameter values, we used the maximum a posteriori probability (MAP) estimate. The optimisation function used is fmincon in MATLAB. All the parameters besides 0 and were free to vary as a function of the horizon as they capture different exploration forms: directed exploration (information bonus ; UCB model), novelty exploration (novelty bonus ), random exploration (inverse temperature ; UCB model), uncertainty-directed exploration (prior variance 0 ; Thompson model) and value-free random exploration ( -greedy parameter). The prior mean 0 was fitted to both horizons together as we did not expect the belief of how good a bandit is to depend on the horizon. The same holds for w, as we assume that the arbitration between the UCB model and the

Model comparison
Model comparison was performed using the Bayesian Information Criterion 4 (BIC). We computed the mean BIC score across participants and the number of participants which are best fit for each model. Additionally, we used a Bayesian model framework and computed the exceedance probability of each model 5 .

Parameter recovery
For each parameter of the winning model (Thompson+ + from the pilot data), we sampled parameter values from a normal distribution defined by the pilot data mean and standard deviation, which we used to simulate behaviour. This was performed N=1000 times. For each simulation, we fitted the model and analysed the correlation between the simulated parameters and the fitted parameters (for confusion matrix cf. Supplementary Figure 6a

Model identification
For each model, behaviour was simulated N=100 times with parameter values sampled from the pilot data mean and standard deviation. All models were fitted to this simulated data and BIC scores compared. The percentage of how often (out of the N=100 simulations) each fitted model won was computed (i.e., confusion matrix, cf. Supplementary Figure 6b and Supplementary Figure 19b). Additionally, the inversion matrix 6 was computed (cf. Supplementary Figure 6c). Please note that key to our model comparison is to assess the benefit of the two exploration heuristics (novelty exploration, value-free random exploration). We expected a degree of trade-off between the different complex models, as they make relatively similar predictions.

Supplementary Note 1: Main experiment
The analyses mentioned below were not preregistered. The first part are analyses which were performed on pilot data in the Stage 1 report, the second part are new analyses (i.e., added after data collection).
Additional analyses (part of the Stage 1 report)

Further analysis of the high-value bandit frequency
The horizon effect on the frequency of picking the high-value bandit was independent of whether the high-value bandit had 3 initial samples (V=153594. 5

Horizon effect on the certain-standard and standard bandits
The frequency of picking the certain-standard bandit was increased in the long versus short horizon (V=157616.5, p<.001, r=0.797; Supplementary Figure 9a). Similarly, the frequency of picking the standard bandit was increased in the long horizon (V=128795.5, p<.001, r=.543; Supplementary Figure  9b).

Block and trial effect on the high-value bandit
In addition to the horizon condition, when adding block as a within participant-factor in the repeated-    [3,9] 3.34 (0.99) [2,9] Long horizon later draws  Figure 3a-b), further breakdown of the reward associated to the short horizon sample, the long horizon 1st sample, and the average long horizon sample (2nd-4th row; cf. Supplementary Figure 3c). The rewards are broken down according to which 3 out of 4 bandits were shown on these trials, and according to which bandit was the high-value bandit. Data is given as mean (std) [range].  Figure 5). Each of the 12 columns indicate a model. The three 'main models' studied were the Thompson model, the UCB model and a hybrid of both. Variants were then created by adding the -greedy parameter, the novelty bonus and a combination of both. To make sure that the UCB models are not penalized because of the higher number of free parameters compared to the Thompson sampling models, we will also look at the UCB models with fixed inverse temperature parameter ( =1). This leads to a total of 16 models (see the labels on the x-axis in Supplementary Figure 5). All the parameters besides 0 and w were fitted to each horizon separately. Parameters: 0 =prior mean (initial estimate of a bandits mean); 0 =prior variance (uncertainty about 0 ); =contribution of UCB vs Thompson; =information bonus; =softmax inverse temperature; = -greedy parameter (stochasticity); =novelty bonus.

Thompson
UCB Hybrid

Supplementary Table 6. Detailed correlations between measures of value-free random exploration and subscales of impulsivity questionnaires.
Bivariate and partial Pearson's correlation (r), Bonferroni corrected (pcor) and uncorrected p values (punc). ASRS: Adult ADHD Self-Report Scale, BIS: Barratt Impulsiveness Scale.  Figure 1. Visualisation of the 9 different sizes that the apples can take. The associated rewards go from 2 (small apple on the right) to 10 (big apple on the left). The reward is given by linearly increasing the radius of the apple.

Supplementary Figure 3. Benefits of exploration (N=580). Effect of information on performance. (a)
The first bandit participants chose as a function of its expected value (average of its initial samples). Participants chose bandits with a lower expected value (i.e., they exploited less) in the long horizon compared to the short horizon (two-sided Wilcoxon signed-rank test: V=110057, p=1.627e-10, Wilcoxon effect size r=0.265). (b) The first bandit participants chose as a function of the number of samples that were initially revealed. Participants chose less known (i.e., more informative) bandits in the long compared to the short horizon (V=160109.5, p=9.087e-82, r=0.796). (c) The first draw in the long horizon led to a lower reward than the first draw in the short horizon (V=131612, p=2.306e-37, r=0.53), indicating that participants sacrificed larger initial outcomes for the benefit of more information. This additional information helped making better decisions in the long run, leading to a higher earning over all draws in the long horizon (V=264, p=4.

Participant recruitment
To take part in the study, participants had to be above 18 years of age and have their current residence in the UK. To ensure data quality, participants were be excluded according to the following criteria: data is incomplete, the mean score (i.e., apple size) is lower than 5.5 indicating participants were performing at chance level (cf. Supplementary Figure 13b), the first draw mean reaction time was faster than 1500ms indicating participants are not allocating much thought to their choice (cf. Supplementary Figure 13c) and if participants failed at least one attention check during the questionnaires meaning that they were not reading the questions. According to these exclusion criteria, N=3 participants were excluded from the pilot data (cf. Supplementary Figure 13).

Participants use exploration beneficially
To evaluate whether participants were able to use exploration beneficially, we looked at their performance (i.e. the outcomes they obtained). We first compared the reward (i.e. apple size) obtained in the short horizon with the first reward obtained in the long horizon. The latter was lower

Participants explore using heuristics
To formally assess which exploration strategies are being used, we turned to computational modeling. Similar to the behavioural analysis, only the first draw of each trial was analysed. We compared 16 models that make different predictions about the usage of exploration strategies (cf. Supplementary Information, Model Descriptions). Participants used a mixture of computationally demanding (i.e., Thompson sampling and/or UCB) and heuristic exploration strategies (i.e., valuefree random exploration and novelty exploration) captured by the winning model (pilot data: BIC

Participants rely more on heuristics in the long horizon
To assess the changes in exploration strategy, we examined the winning model's fitted parameters. Those parameters were fitted to the first draw of all trials of each participant. The ϵ-greedy parameter, which captures the contribution of value-free random exploration, was increased in the long (versus short) horizon (pilot data: t(60)=-3.

Block and trial effect on the high-value bandit
In addition to the horizon condition, when adding block as a within participant-factor in the repeated-measures ANOVA, there was an additional effect of block on the high-value bandit ( Similarly, when analysing the frequency of picking the high-value bandit per trial, a decrease was observed in the short horizon (linear regression slope vs null slope: t(60)=-4.38, p<.001) and in the long horizon (t(60)=-5.04, p<.001).

Block and trial effect on the novelty bandit
In addition to the horizon condition, when adding block as a within participant-factor in the repeated-measures ANOVA, there was an additional effect of block on the novelty bandit ( . Similarly when analysing the frequency of picking the novel bandit per trial, an increase was observed in the short horizon (linear regression slope vs null slope: t(60)=5.88, p<.001) and in the long horizon (t(60)=6.66, p<.001).

Additional block-dependent novelty parameter
Given the observed minor change in the novelty bandit frequency across blocks and trials (cf. above), we extended our model comparison with a model comprising a block-dependent quantity of the novelty bonus, which we named η B . This model performed similarly to our winning model when looking at the average score (Thompson sampling+ϵ+η:  . Importantly adding such a parameter did not affect our main parameter of interest ϵ (correlation between ϵ from the Thompson sampling+ϵ+η model and ϵ from the Thompson sampling+ϵ+η+η B model: short horizon: r=1, p<.001; long horizon: r=.099, p<.001).

No evidence of meta-learning on number of initial samples
We found no evidence that participants prefer the certain-standard bandit over the standard bandit, neither in the short horizon (linear regression between [certain-standard bandit frequency minus standard bandit frequency] and [trial]: slope vs null slope: t(60)=0.516, p=.608) nor in the long horizon (t(60)=-0.794, p=.43). Moreover, we also find no evidence for choosing any 1-sample bandit (standard or low-value bandit) less compared to the 3-sample bandit over time, neither in the short horizon (linear regression between [3-sample bandit frequency minus 1-sample bandit frequency] and [trial]: slope vs null slope: t(60)=0.85, p=.399), nor in the long horizon (t(60)=-0.326, p=.746).

No difference in bandit colour occurrence
When performing an ANOVA with the participant identifier colour (8 sets of 3 different colours, 8^3 = 24 different colours) and the within factor bandit there was no evidence of a difference in bandit occurrence for each colour (bandit main effect: F(2.34,53.91)=0, p=1, pes=0), meaning that each type of bandit is shown in every combination equally often.  Figure 14a-b), further breakdown of the reward associated to the short horizon sample, the long horizon 1 st sample, and the average long horizon sample (2 nd -4 th row; cf. Supplementary Figure 14c). The rewards are broken down according to which 3 out of 4 bandits were shown on these trials, and according to which bandit was the highvalue bandit (highest mean; cf. Supplementary  Table 22. High-value bandit information (pilot data; N=61). Percentage of trials where the certain-standard bandits' 3 initial samples is larger than the initial sample of the standard bandit ( > ) and vice versa ( > ), in the conditions were both bandits are present. Statistics demonstrate that they occur equally often. Participants are excluded if (b) their total score (average apple size) was lower than 5.5 (horizontal line) and if (c) their reaction time on the 1 st choice was lower than 1500ms (horizontal line). According to those exclusion criteria, N=3 participants out of the N=64 pilot participants were excluded from further analyses. Data is shown as mean ± 95%CI and each dot represent one participant. Figure 14. Benefits of exploration (pilot data; N=61). Effect of information on performance. (a) The first bandit participants chose as a function of its expected value (average of its initial samples). Participants chose bandits with a lower expected value (i.e., they exploited less) in the long horizon compared to the short horizon. (b) The first bandit participants chose as a function of the number of samples that were initially revealed. Participants chose less known (i.e., more informative) bandits in the long compared to the short horizon. (c) The first draw in the long horizon led to a lower reward than the first draw in the short horizon, indicating that participants sacrificed larger initial outcomes for the benefit of more information. This additional information helped making better decisions in the long run, leading to a higher earning over all draws in the long horizon. *** p<.001. Data are shown as mean ± 95%CI and each dot/line represent one participant. For each parameter of the winning model (Thompson+ + from the pilot data), we sampled parameter values from a normal distribution defined by the pilot data mean and standard deviation, which we used to simulate behaviour (for additional simulations cf. Fig. S13). This was performed N=20000 times. For each simulation, we fitted the model and computed the Pearson correlation r between the simulated parameters and the fitted parameters. (b) Model identification. For each model, behaviour was simulated N=100 times with parameter values sampled from the pilot data mean and standard deviation. All models were fitted to this simulated data and BIC scores compared. The percentage p of how often each fitted model won was computed. The lower recovery of the full-UCB model is likely to reflects the conservative nature of the BIC which punishes its high complexity to the advantage of the simpler (but equally versatile) full-Thompson model.